Position modeling plays a critical role in Transformers. In this paper, we focus on length extrapolation, i.e., training on short texts while evaluating longer sequences. We define attention resolution as an indicator of extrapolation. Then we propose two designs to improve the above metric of Transformers. Specifically, we introduce a relative position embedding to explicitly maximize attention resolution. Moreover, we use blockwise causal attention during inference for better resolution. We evaluate different Transformer variants with language modeling. Experimental results show that our model achieves strong performance in both interpolation and extrapolation settings. The code will be available at https://aka.ms/LeX-Transformer.
translated by 谷歌翻译
Personalized chatbots focus on endowing the chatbots with a consistent personality to behave like real users and further act as personal assistants. Previous studies have explored generating implicit user profiles from the user's dialogue history for building personalized chatbots. However, these studies only use the response generation loss to train the entire model, thus it is prone to suffer from the problem of data sparsity. Besides, they overemphasize the final generated response's quality while ignoring the correlations and fusions between the user's dialogue history, leading to rough data representations and performance degradation. To tackle these problems, we propose a self-supervised learning framework MCP for capturing better representations from users' dialogue history for personalized chatbots. Specifically, we apply contrastive sampling methods to leverage the supervised signals hidden in user dialog history, and generate the pre-training samples for enhancing the model. We design three pre-training tasks based on three types of contrastive pairs from user dialogue history, namely response pairs, sequence augmentation pairs, and user pairs. We pre-train the utterance encoder and the history encoder towards the contrastive objectives and use these pre-trained encoders for generating user profiles while personalized response generation. Experimental results on two real-world datasets show a significant improvement in our proposed model MCP compared with the existing methods.
translated by 谷歌翻译
背景:宫颈癌严重影响了女性生殖系统的健康。光学相干断层扫描(OCT)作为宫颈疾病检测的非侵入性,高分辨率成像技术。然而,OCT图像注释是知识密集型和耗时的,这阻碍了基于深度学习的分类模型的培训过程。目的:本研究旨在基于自我监督学习,开发一种计算机辅助诊断(CADX)方法来对体内宫颈OCT图像进行分类。方法:除了由卷积神经网络(CNN)提取的高电平语义特征外,建议的CADX方法利用了通过对比纹理学习来利用未标记的宫颈OCT图像的纹理特征。我们在中国733名患者的多中心临床研究中对OCT图像数据集进行了十倍的交叉验证。结果:在用于检测高风险疾病的二元分类任务中,包括高级鳞状上皮病变和宫颈癌,我们的方法实现了0.9798加号或减去0.0157的面积曲线值,灵敏度为91.17加或对于OCT图像贴片,减去4.99%,特异性为93.96加仑或减去4.72%;此外,它在测试集上的四位医学专家中表现出两种。此外,我们的方法在使用交叉形阈值投票策略的118名中国患者中达到了91.53%的敏感性和97.37%的特异性。结论:所提出的基于对比 - 学习的CADX方法表现优于端到端的CNN模型,并基于纹理特征提供更好的可解释性,其在“见和治疗”的临床协议中具有很大的潜力。
translated by 谷歌翻译
This paper presents a novel framework for planning in unknown and occluded urban spaces. We specifically focus on turns and intersections where occlusions significantly impact navigability. Our approach uses an inpainting model to fill in a sparse, occluded, semantic lidar point cloud and plans dynamically feasible paths for a vehicle to traverse through the open and inpainted spaces. We demonstrate our approach using a car's lidar data with real-time occlusions, and show that by inpainting occluded areas, we can plan longer paths, with more turn options compared to without inpainting; in addition, our approach more closely follows paths derived from a planner with no occlusions (called the ground truth) compared to other state of the art approaches.
translated by 谷歌翻译
Large pretrained language models have shown surprising In-Context Learning (ICL) ability. With a few demonstration input-label pairs, they can predict the label for an unseen input without additional parameter updates. Despite the great success in performance, the working mechanism of ICL still remains an open problem. In order to better understand how ICL works, this paper explains language models as meta-optimizers and understands ICL as a kind of implicit finetuning. Theoretically, we figure out that the Transformer attention has a dual form of gradient descent based optimization. On top of it, we understand ICL as follows: GPT first produces meta-gradients according to the demonstration examples, and then these meta-gradients are applied to the original GPT to build an ICL model. Experimentally, we comprehensively compare the behavior of ICL and explicit finetuning based on real tasks to provide empirical evidence that supports our understanding. The results prove that ICL behaves similarly to explicit finetuning at the prediction level, the representation level, and the attention behavior level. Further, inspired by our understanding of meta-optimization, we design a momentum-based attention by analogy with the momentum-based gradient descent algorithm. Its consistently better performance over vanilla attention supports our understanding again from another aspect, and more importantly, it shows the potential to utilize our understanding for future model designing.
translated by 谷歌翻译
Large language models have exhibited intriguing in-context learning capability, achieving promising zero- and few-shot performance without updating the parameters. However, conventional in-context learning is usually restricted by length constraints, rendering it ineffective to absorb supervision from a large number of examples. In order to go beyond few shots, we introduce structured prompting that breaks the length limit and scales in-context learning to thousands of examples. Specifically, demonstration examples are separately encoded with well-designed position embeddings, and then they are jointly attended by the test example using a rescaled attention mechanism. So we can scale the number of exemplars with linear complexity instead of quadratic complexity with respect to length. Experimental results on a diverse set of tasks show that our approach improves end-task performance and reduces evaluation variance over conventional in-context learning as the number of demonstration examples increases. Code has been released at https://aka.ms/structured-prompting.
translated by 谷歌翻译
It is crucial to choose the appropriate scale in order to build an effective and informational representation of a complex system. Scientists carefully choose the scales for their experiments to extract the variables that describe the causalities in the system. They found that the coarse scale(macro) is sometimes more causal and informative than the numerous-parameter observations(micro). The phenomenon that the causality emerges by coarse-graining is called Causal Emergence(CE). Based on information theory, a number of recent works quantitatively showed that CE indeed happens while coarse-graining a micro model to the macro. However, the existing works have not discussed the question of why and when the CE happens. We quantitatively analyze the redistribution of uncertainties for coarse-graining and suggest that the redistribution of uncertainties is the cause of causal emergence. We further analyze the thresholds that determine if CE happens or not. From the regularity of the transition probability matrix(TPM) of discrete systems, the mathematical expressions of the model properties are derived. The values of thresholds for different operations are computed. The results provide the critical and specific conditions of CE as helpful suggestions for choosing the proper coarse-graining operation. The results also provided a new way to better understand the nature of causality and causal emergence.
translated by 谷歌翻译
Tourette Syndrome (TS) is a behavior disorder that onsets in childhood and is characterized by the expression of involuntary movements and sounds commonly referred to as tics. Behavioral therapy is the first-line treatment for patients with TS, and it helps patients raise awareness about tic occurrence as well as develop tic inhibition strategies. However, the limited availability of therapists and the difficulties for in-home follow up work limits its effectiveness. An automatic tic detection system that is easy to deploy could alleviate the difficulties of home-therapy by providing feedback to the patients while exercising tic awareness. In this work, we propose a novel architecture (T-Net) for automatic tic detection and classification from untrimmed videos. T-Net combines temporal detection and segmentation and operates on features that are interpretable to a clinician. We compare T-Net to several state-of-the-art systems working on deep features extracted from the raw videos and T-Net achieves comparable performance in terms of average precision while relying on interpretable features needed in clinical practice.
translated by 谷歌翻译
张张量强大的主成分分析(TRPCA)旨在恢复因稀疏噪声破坏的低排名张量,在许多真实应用中引起了很多关注。本文开发了一种新的全球加权TRPCA方法(GWTRPCA),该方法是第一种同时考虑额外域内切片和额叶间切片奇异值的重要性。利用这些全球信息,GWTRPCA惩罚了较大的单数值,并为其分配了较小的权重。因此,我们的方法可以更准确地恢复低管级组件。此外,我们提出了通过改良的考奇估计量(MCE)的有效自适应学习策略,因为重量设置在GWTRPCA的成功中起着至关重要的作用。为了实现GWTRPCA方法,我们使用乘数的交替方向方法(ADMM)方法设计了一种优化算法。对现实世界数据集的实验验证了我们提出的方法的有效性。
translated by 谷歌翻译
室外(OOD)检测是面向任务的对话框系统中的关键组件,旨在确定查询是否不在预定义的支持的意图集之外。事实证明,先前基于软磁性的检测算法对OOD样品被过度自信。在本文中,我们分析了过度自信的OOD来自由于训练和测试分布之间的不匹配而导致的分布不确定性,这使得该模型无法自信地做出预测,因此可能导致异常软磁得分。我们提出了一个贝叶斯OOD检测框架,以使用Monte-Carlo辍学来校准分布不确定性。我们的方法是灵活的,并且可以轻松地插入现有的基于软磁性的基线和增益33.33 \%OOD F1改进,而与MSP相比仅增加了0.41 \%的推理时间。进一步的分析表明,贝叶斯学习对OOD检测的有效性。
translated by 谷歌翻译